Efficient pattern matching in degenerate strings with the Burrows-Wheeler transform
نویسندگان
چکیده
A degenerate or indeterminate string on an alphabet Σ is a sequence of non-empty subsets of Σ. Given a degenerate string t of length n, we present a new method based on the Burrows–Wheeler transform for searching for a degenerate pattern of length m in t running in O(mn) time on a constant size alphabet Σ. Furthermore, it is a hybrid patternmatching technique that works on both regular and degenerate strings. A degenerate string is said to be conservative if its number of non-solid letters is upper-bounded by a fixed positive constant q; in this case we show that the search complexity time is O(qm). Experimental results show that our method performs well in practice.
منابع مشابه
Wheeler Graphs: Variations on a Theme by Burrows and Wheeler
The famous Burrows-Wheeler Transform was originally defined for single strings but variations have been developed for sets of strings, labelled trees, de Bruijn graphs, alignments, etc. In this talk we propose a unifying view that includes many of these variations and that we hope will simplify the search for more. Somewhat surprisingly we get our unifying view by considering the Nondeterminist...
متن کاملCompressed-Domain Pattern Matching with the Burrows-Wheeler Transform
This report investigates two approaches for online pattern-matching in files compressed with the Burrows-Wheeler transform (Burrows & Wheeler 1994). The first is based on the Boyer-Moore pattern matching algorithm (Boyer & Moore 1977), and the second is based on binary search. The new methods use the special structure of the BurrowsWheeler transform to achieve efficient, robust pattern matching...
متن کاملA Text Transformation Scheme for Degenerate Strings
The Burrows-Wheeler Transformation computes a permutation of a string of letters over an alphabet, and is well-suited to compression-related applications due to its invertability and data clustering properties. For space e ciency the input to the transform can be preprocessed into Lyndon factors. We consider scenarios with uncertainty regarding the data: a position in an indeterminate or degene...
متن کاملSearching for Unique DNA Sequences with the Burrows-Wheeler Transform
The objective of this study was to present an efficient algorithm that effectively aids the problem of searching for unique DNA sequences in the set of genes. The presented algorithm is based on the Burrows-Wheeler Transform (BWT), a very fast and effective data compression algorithm. The developed algorithm exploits all the advantages offered by the BWT algorithm and the suffix array data stru...
متن کاملOn the Massive String Problem
In this paper, we discuss an efficient and effective index mechanism to support the matching of massive pattern strings in against a very long target string. It is very important to the next generation sequencing in the biological research. The main idea behind it is to construct an automaton over all the pattern strings, and search the automaton against a BWT-array L created for a target strin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1708.01130 شماره
صفحات -
تاریخ انتشار 2017